A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate it if we could predict for them who is going to get churned so they can proactively go to the customer to provide them better services and turn customers' decisions in the opposite direction. This dataset is from a website with the URL https://leaps.analyttica.com/home
Which customers are more likely to leave the bank?
This project is done in a series of phases, the first of which involves an exploratory data analysis, where the objective is to know the nature of the variables and to examine attributes that indicate a strong relationship with leaving credit card services. The next phase involves applying a machine-learning algorithm to find the best properties for building the model. At the end of the project, after finishing all steps, a machine learning model will be utilized, adept at predicting, based on the data of a structure, whether a customer will cancel the credit card service or not.
We have 10,127 customers and out of those 8500 customers are existing and 1627 are attritted customers which gives us percentages or 84.9% for existing customers vs 16.1% of attrited customers since we only 16% of the customers who have churned. Thus, it's a bit difficult to train our model to predict churning customers since the sample is very small comparing the total number of customers.
The data has 20 features. Of the features provided, 6 are categorical variables (including the target which is the attrition flag), 13 are continuous variables, and one is a ratio. The data has numerous demographic data, as well as data about the relationship the bank currently has with its customer. The information would be of particular use to the customer retention department, whose success is measured by the number of customers that can retain. The bank currently has limited solutions in place to predict whether a customer will churn or not, and our goal is to predict which customers are more likely to leave the bank. The initial success of the model will be assessed by its accuracy on a train/test split, but ultimately the success of the model is measured by how accurately it can predict whether a customer will leave the bank.
import pandas as pd
import numpy as np
import statsmodels.api as sm
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from sklearn.metrics import confusion_matrix, classification_report
import sklearn.metrics as metrics
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import neighbors
from math import sqrt
#To perfom Exploratory Data Analysis (EDA) in just one line of a code
import pandas_profiling
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc, log_loss
from sklearn import utils
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
#import warnings
#import numpy as np
#warnings.simplefilter(action='ignore', category=FutureWarning)
#print('x' in np.arange(5)) #returns False, without Warning
#Makes the notebook full width for preference
#from IPython.core.display import display, HTML
#display(HTML("<style>.container { width:100% !important; }</style>"))
df = pd.read_csv('BankChurners.csv')
df.shape
df
The objective of this step is to dig deep into the data to discover the main elements that are contributing to the cancellation of credit card service bank customers. For this step we will use panda profiling to derive statistical information of each variable (Descriptive Analysis & Visualization) and to check the correlation between features which will provide important insights to carry on with the rest of the analysis in this project
#Deleting unnecessary columns
del df['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1']
del df['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2']
df.head()
#Copying the dataframe
df1 = df
df.info()
df.shape
pandas_profiling.ProfileReport(df, title = 'Credit Card Churn', html={'style':{'full_width':True}})